Part 1
Part 2
Part 3
R has several core data structures:
Vectors form the basis of R data structures.
Two main types- atomic and lists, but I will treat lists separately.
Here is an R vector. The elements of the vector are numeric values.
x = c(1, 3, 2, 5, 4)
x[1] 1 3 2 5 4
All elements of an atomic vector are the same type. Examples include:
A important type of vector is a factor. Factors are used to represent categorical data structures.
x = factor(1:3, labels=c('q', 'V', 'what the heck?'))
x[1] q V what the heck?
Levels: q V what the heck?
While the underlying representation is numeric, factors are categorical, and so can’t be used as numbers would be.
as.numeric(x)[1] 1 2 3
sum(x)Error in Summary.factor(structure(1:3, .Label = c("q", "V", "what the heck?": 'sum' not meaningful for factors
When we move to multiple dimensions, we are dealing with arrays.
Matrices are 2-d arrays, and extremely commonly used.
In R, the vectors making up a matrix must all be of the same type.
Creating a matrix can be done in a variety of ways.
# create vectors
x = 1:4
y = 5:8
z = 9:12
rbind(x, y, z) # row bind [,1] [,2] [,3] [,4]
x 1 2 3 4
y 5 6 7 8
z 9 10 11 12
cbind(x, y, z) # column bind x y z
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
matrix(c(x, y, z), nrow=3, ncol=4, byrow=TRUE) [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
Lists in R are highly flexible objects that can contain anything as their elements, even other lists.
Here is an R list. We use the list function to create one.
x = list(1, "apple", list(3, "cat"))
x[[1]]
[1] 1
[[2]]
[1] "apple"
[[3]]
[[3]][[1]]
[1] 3
[[3]][[2]]
[1] "cat"
We can use a loop to see the class of each object in the elements.
for(elem in x) class(elem)Lists can be and are often named.
x = list("a" = 25, "b" = -1, "c" = 0)
x["b"]$b
[1] -1
Data frames are the most commonly used data structure.
Unlike matrices, they do not have to have the same type of element.
This is because the data.frame class is actually just a list.
As such, everything about lists applies to data.frames, but they can also be indexed by row or column as well like matrices.
mydf = data.frame(a = c(1,5,2),
b = c(3,8,1))We can add row names also.
rownames(mydf) = paste0('row', 1:3)
mydf a b
row1 1 3
row2 5 8
row3 2 1
Standard methods of reading in data
Using the foreign package:
Note: the foreign package is no longer useful for Stata files.
haven: Package to read in foreign statistical files
readxl: for excel files
readr: Faster versions of base R functions
These make assumptions after an initial scan of the data.
If you don’t have ‘big’ data, this won’t help much.
However, they actually can be used as a diagnostic
data.table: faster read.table
Typically faster than readr approaches.
Note that R has many resources for dealing with other types of data.
Some examples:
And many, many others.
feather: designed to make reading and writing data frames efficient
Works in both Python and R.
Still in early stages of development.
Slicing vectors
letters[4:6][1] "d" "e" "f"
letters[c(13,10,3)][1] "m" "j" "c"
Slicing matrices/data.frames
myMatrix[1, 2:3]Label-based indexing:
mydf['row1', 'b']Position-based indexing:
mydf[1, 2]Mixed indexing:
mydf['row1', 2]If the row/column value is empty, all rows/columns are retained.
mydf['row1',]
mydf[,'b']Non-contiguous:
mydf[c(1,3),]Boolean:
mydf[mydf$a >=2,]List/Data.frame extraction
[ : grab a slice of elements/columns
[[ : grab specific elements/columns
$ : grab specific elements/columns
my_list_or_df[2:4]my_list_or_df[['name']]my_list_or_df$nameWe can take values corresponding to some operation that results in TRUE or FALSE.
Assume x is a vector of numbers.
idx = x > 2
idx
x[idx]We actually don’t have to create a Boolean object before using it.
R indexing is ridiculously flexible.
x[x > 2]
x[x != 3]
x[ifelse(x > 2, T, F)]
x[{y = idx; y}]Consider the following loop:
for (i in 1:nrow(mydf)) {
check = mydf$x[i] > 2
if (check==TRUE){
mydf$y[i] = 'Yes'
} else {
mydf$y[i] = 'No'
}
}Compare:
mydf$y = 'No'
mydf$y[mydf$x > 2] = 'Yes'This gets us the same thing.
Boolean indexing provides an example of a vectorized operation.
The whole vector is considered rather than each element individually.
This is always faster.
Log all values in a matrix.
mymatrix_log = log(mymatrix)Would be a lot faster than looping over elements, rows or columns.
Many vectorized functions already exist in R.
They are also often written in C, Fortran etc., and so even faster.
In R there a family of functions that allow for a succinct way of looping.
Standardizing variables
for (i in 1:ncol(mydf)){
x = mydf[,i]
for (j in 1:length(x)){
x[j] = (x[j] - mean(x))/sd(x)
}
}This would be a really bad way to use R.
stdize <- function(x) {
(x-mean(x))/sd(x)
}
## apply(mydf, 2, stdize)Unit: milliseconds
expr min lq mean median uq max neval
doubleloop 3404.268719 3483.89474 3516.43856 3529.94528 3554.93860 3598.25409 25
singleloop 33.967538 35.81762 37.70332 37.34829 38.70666 43.65861 25
plyr 141.248976 148.02459 157.58312 150.56991 163.47453 198.68966 25
apply 36.181748 39.19026 40.73384 40.04221 41.64133 47.46504 25
vectorized 8.059534 10.46707 13.50199 11.19833 12.62214 53.21460 25
Benefits
NOT faster than explict loops.
ALWAYS can potentially be faster than loops.
I use R every day, and I only use a loop for a sequential operation.
I never use a double loop.
Using apply functions should be a part of your regular R experience.
Other versions we’ll talk about have been optimized.
However, you need to know the basics in order to use those.
Any you still may need parallel versions.
Note: more detail on much of this part is given in another workshop.
Operators that send what comes before to what comes after.
There are many different pipes.
There are many packages that use their own.
However, the vast majority of packages use the same pipe:
Here, we’ll focus on their use with the dplyr package.
Later, we’ll use it for visualizations.
Example.
mydf %>%
select(var1, var2) %>%
filter(var1 == 'Yes') %>%
summaryStart with a data.frame, select columns from it, filter/subset it, get a summary.
We can use variables as soon as they are created.
mydf %>%
mutate(newvar1 = var1 + var2,
newvar2 = newvar1/var3) %>%
summarise(newvar2avg = mean(newvar2))Generic example.
basegraph %>%
points %>%
lines %>%
layoutSometimes you’ll want to use functions in a way in which they won’t be aware of the piped object or the columns in it.
Example: pipe to a modeling function
mydf %>%
lm(y ~ x) # errorWhile other pipes can do this (e.g. %$% in magrittr) one can use a dot.
mydf %>%
lm(y ~ x, data=.)Piping is not just for data.frames.
c('Ceci', "n'est", 'pas', 'une', 'pipe!') %>%
{
.. <- . %>%
if (length(.) == 1) .
else paste(.[1], '%>%', ..(.[-1]))
..(.)
} [1] "Ceci %>% n'est %>% pas %>% une %>% pipe!"
Pipes are best used interactively.
Extremely useful for data exploration.
Common in many visualization packages.
See the magrittr packge for more pipes.
Original data managment package of the three.
More general than dplyr.
Not as useful for most common operations, but contains:
adply, dlply etc.
library(plyr)
x = list(var1=1:5, var2=2:6)
ldply(x) .id V1 V2 V3 V4 V5
1 var1 1 2 3 4 5
2 var2 2 3 4 5 6
ldply(x, sum) .id V1
1 var1 15
2 var2 20
Option to parallelize.
*ply: apply style functions, with parallel capability
join_all: Recursively join a list of data frames
rbind.fill: Combine data.frames by row, filling in missing columns.
mapvalues/revalue: replace values
round_any: Round to multiple of any number.
Grammar of data manipulation.
Next iteration of plyr.
Focused on tools for working with data frames.
It has three main goals:
Make the most important data manipulation tasks easier.
Do them faster.
Use the same interface to work with data frames, a data tables or database.
Some key operations:
select: grab columns
filter/slice: grab rows
group_by: grouped operations
mutate/transmute: create new variables
summarize: summarise/aggregate
do: arbitrary operations
Various join/merge functions.
Little things like:
No need to quote variable names.
Let’s say we want to select from our data the following variables:
How might we go about this?
Tedious, or typically two steps just to get the columns you want.
# numeric indexes; not conducive to readibility or reproducibility
newData = oldData[,c(1,2,3,4, etc.)]
# explicitly by name; fine if only a handful; not pretty
newData = oldData[,c('ID','X1', 'X2', etc.)]
# two step with grep; regex difficult to read/understand
cols = c('ID', paste0('X', 1:10), 'var1', 'var2', grep(colnames(oldData), '^XYZ', value=T))
newData = oldData[,cols]
# or via subset
newData = subset(oldData, select = cols)What if you also want observations where Z is Yes, Q is No, and only the observations with the top 50 values of var2, ordered by var1 (descending)?
# three operations and overwriting or creating new objects if we want clarity
newData = newData[oldData$Z == 'Yes' & oldData$Q == 'No',]
newData = newData[order(newData$var2, decreasing=T)[1:50],]
newData = newData[order(newData$var1, decreasing=T),]And this is for fairly straightforward operations.
newData = oldData %>%
filter(Z == 'Yes', Q == 'No') %>%
select(num_range('X', 1:10), contains('var'), starts_with('XYZ')) %>%
top_n(var2, n=50) %>%
arrange(desc(var1))dplyr and piping is an alternative
Even though the initial base R approach depicted is fairly concise, it still can be potentially:
Two primary functions for manipulating data
Other useful functions include:
library(tidyr)
stocks <- data.frame( time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4) )
stocks %>% head time X Y Z
1 2009-01-01 -1.1866420 0.61441664 -2.1011634
2 2009-01-02 -1.9764097 -3.48103599 5.8686859
3 2009-01-03 0.3170110 -1.35598840 4.1802379
4 2009-01-04 -0.5648983 1.44583225 -0.7076149
5 2009-01-05 1.0914138 3.81377475 -3.9097850
6 2009-01-06 -0.5586778 0.03054858 -3.9985824
stocks %>% gather(stock, price, -time) %>% head time stock price
1 2009-01-01 X -1.1866420
2 2009-01-02 X -1.9764097
3 2009-01-03 X 0.3170110
4 2009-01-04 X -0.5648983
5 2009-01-05 X 1.0914138
6 2009-01-06 X -0.5586778
I find the dplyr grammar to be clear for a lot of standard data processing.
The best usage of it is for on-the-fly data exploration and visualization.
Drawbacks:
multidplyr
Partitions the data across a cluster.
Faster than data.table (after partitioning)
data.table works in a notably different way than dplyr.
However, you’d use it for the same reasons.
Like dplyr, the data objects are both data.frames and a package specific class.
Faster subset, grouping, update, ordered joins and list columns
In general, data.table works with brackets as in base R.
However, the brackets work like a function call and have several arguments.
x[i, j, by, keyby, with = TRUE, ...]Importantly, you can’t use the brackets as you would with data.frames.
library(data.table)
df = data.table(x=rnorm(6), g=1:3, y=runif(6))
df[,4][1] 4
Grab rows
df[1:3,] x g y
1: -0.1915347 1 0.9621328
2: 0.5369777 2 0.8398547
3: 0.8996597 3 0.3863851
Grab columns.
df[,y][1] 0.9621328 0.8398547 0.3863851 0.7486282 0.6252087 0.2664072
Dropping columns is awkward.
This is because the second part of the data.table object is an argument ‘j’.
df[,-y] [1] -0.9621328 -0.8398547 -0.3863851 -0.7486282 -0.6252087 -0.2664072
df[,-'y', with=F] x g
1: -0.1915347 1
2: 0.5369777 2
3: 0.8996597 3
4: 0.1508541 1
5: -0.8270485 2
6: 0.9119148 3
group-by, with creation of a new variable.
Note that these actually modify df in place.
df1 = df2 = df
df[,sum(x,y), by=g] g V1
1: 1 1.670080
2: 2 1.174993
3: 3 2.464367
df1[,newvar := sum(x,y), by=g] x g y newvar
1: -0.1915347 1 0.9621328 1.670080
2: 0.5369777 2 0.8398547 1.174993
3: 0.8996597 3 0.3863851 2.464367
4: 0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 1.174993
6: 0.9119148 3 0.2664072 2.464367
df1 x g y newvar
1: -0.1915347 1 0.9621328 1.670080
2: 0.5369777 2 0.8398547 1.174993
3: 0.8996597 3 0.3863851 2.464367
4: 0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 1.174993
6: 0.9119148 3 0.2664072 2.464367
We can also create groupings on the fly.
df2[,newvar := sum(x,y), by=g==1] x g y newvar
1: -0.1915347 1 0.9621328 1.670080
2: 0.5369777 2 0.8398547 3.639359
3: 0.8996597 3 0.3863851 3.639359
4: 0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 3.639359
6: 0.9119148 3 0.2664072 3.639359
df2 x g y newvar
1: -0.1915347 1 0.9621328 1.670080
2: 0.5369777 2 0.8398547 3.639359
3: 0.8996597 3 0.3863851 3.639359
4: 0.1508541 1 0.7486282 1.670080
5: -0.8270485 2 0.6252087 3.639359
6: 0.9119148 3 0.2664072 3.639359
df1[df2]The following demonstrates some timings from here
By the way, never, ever use aggregate. For anything.
fun elapsed
1: aggregate 114.35
2: by 24.51
3: sapply 11.62
4: tapply 11.33
5: dplyr 10.97
6: lapply 10.65
7: data.table 2.71
Ever.
Really.
Can be done but awkward at best.
mydf[,newvar:=mean(x),][,newvar2:=sum(newvar), by=group][,-'y', with=FALSE]
mydf[,newvar:=mean(x),
][,newvar2:=sum(newvar), by=group
][,-'y', with=FALSE
]Probably better to just use a pipe and dot approach
mydf[,newvar:=mean(x),] %>%
.[,newvar2:=sum(newvar), by=group] %>%
.[,-'y', with=FALSE]Faster methods are great to have.
Drawbacks:
If speed and/or memory is (potentially) a concern, data.table
For interactive exploratoin, dplyr
Piping allows one to use both, so no need to choose.
ggplot2 is an extremely popular package for visualization in R.
It entails a grammar of graphics.
Key ideas:
Strengths:
Aesthetics allow one to map data to aesthetic aspects of the plot.
The function used in ggplot to do this is aes
aes(x=myvar, y=myvar2, color=myvar3, group=g)In general, we start with a base layer and add to it.
In most cases you’ll start as follows.
ggplot(aes(x=myvar, y=myvar2), data=mydata)This would not produce anything except for a plot background.
Layers are added via piping.
The first layers added are typically geoms:
ggplot2 was using pipes before it was cool, and so it has a different pipe.
Othewise, the concept is the same as before.
ggplot(aes(x=myvar, y=myvar2), data=mydata) +
geom_point()And now we would have a scatterplot.
library(ggplot2)
data("diamonds"); data('economics')
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point()ggplot(aes(x=date, y=unemploy), data=economics) +
geom_line()In the following, one setting is not mapped to the data.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(size=carat, color=clarity), alpha=.25) There are many statistical functions built in.
One of the key strengths of ggplot is that you don’t have to do much preprocessing.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_quantile()ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()ggplot(mtcars, aes(cyl, mpg)) +
geom_point() +
stat_summary(fun.data = "mean_cl_boot", colour = "orange", alpha=.75, size = 1)Facets allow for panelled display, a very common operation.
In general, we often want comparison plots.
facet_grid will produce a grid.
facet_wrap is more flexible.
Both use a formula approach to specify the grouping.
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_grid(vs ~ cyl, labeller = label_both)ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_wrap(vs ~ cyl, labeller = label_both, ncol=2)ggplot2 makes it easy to get good looking graphs quickly.
However the amount of fine control is extensive.
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(aes(color=clarity), alpha=.5) +
scale_y_log10(breaks=c(1000,5000,10000)) +
xlim(0, 10) +
scale_color_brewer(type='div') +
facet_wrap(~cut, ncol=3) +
theme_minimal() +
theme(axis.ticks.x=element_line(color='darkred'),
axis.text.x=element_text(angle=-45),
axis.text.y=element_text(size=20),
strip.text=element_text(color='forestgreen'),
strip.background=element_blank(),
panel.grid.minor=element_line(color='blue'),
legend.key=element_rect(linetype=4),
legend.position='bottom')
In the last example you saw two uses of a theme.
Each argument takes on a specific value or an element function:
The base theme is not too good.
You will almost invariably need to tweek it.
While many contributed before, ggplot2 now has an extension system.
There is even a website to track the extensions.
Examples include:
ggplot2 is an easy to use, but powerful visualization tool.
Allows one to think in many dimensions for any graph:
2d graphs are not useful for conveying anything but the simplest ideas.
Use ggplot2 to easily go beyond 2d for interesting visualizations.
ggplot2 is the most widely used package for visualization in R.
However, it is not interactive by default.
Many packages use htmlwidgets, d3 (JavaScript library) etc. to provide interactive graphics.
General:
Specific functionality:
One of the advantages to piping is that it’s not limited to dplyr style data management functions.
Any R function can be potentially piped to, and we’ve seen several examples so far.
This facilitates data exploration, especially visually.
Many newer visualization packages take advantage of piping as well.
htmlwidgets is a package that makes it easy to use R to create javascript visualizations.
The packages using it typically are pipe-oriented and produce interactive plots.
A couple demonstrations with plotly.
Note the layering as with ggplot2.
Piping used before plotting.
library(plotly)
midwest %>%
filter(inmetro==T) %>%
plot_ly(x=percollege, y=percbelowpoverty, mode='markers') Plotly has modes, which allow for points, lines, text and combinations.
Traces work similar to geoms.
library(mgcv)
mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste(wt, mpg, amFactor),
prediction = predict(gam(mpg~s(wt), data=mtcars))) %>%
arrange(wt) %>%
plot_ly(x=wt, y=mpg, color=amFactor, width=800, height=500, mode='markers') %>%
add_trace(x=wt, y=prediction, alpha=.5, hover=hovertext, name='gam prediction')The nice thing about plotly is that we can feed a ggplot to it.
It would have been easier to use geom_smooth, so let’s do so.
gp = mtcars %>%
mutate(amFactor = factor(am, labels=c('auto', 'manual')),
hovertext = paste(wt, mpg, amFactor),
prediction = predict(gam(mpg~s(wt), data=mtcars))) %>%
arrange(wt) %>%
ggplot(aes(x=wt, y=mpg)) +
geom_smooth() +
geom_point(aes(color=amFactor))
ggplotly()Dygraphs are useful for time-series.
library(dygraphs)
data(UKLungDeaths)
cbind(ldeaths, mdeaths, fdeaths) %>%
dygraph(width=800) %>%
dyOptions(stackedGraph = TRUE, colors=RColorBrewer::brewer.pal(3, name='Dark2')) %>%
dyRangeSelector(height = 20)library(visNetwork)
visNetwork(nodes, edges, height=600, width=800) %>%
visNodes(shape='circle',
font=list(),
scaling=list(min=10, max=50, label=list(enable=T))) %>%
visLegend()library(DT)
movies %>%
select(1:6) %>%
filter(rating>9) %>%
slice(sample(1:nrow(.), 50)) %>%
datatable(rownames=F)Shiny is a framework that can essentially allow you to build an interactive website.
Most of the more recently developed visualization packages will work specifically within the shiny and rmarkdown settings.
Interactivity allows for even more dimensions to be brought to a graphic.
Interactive graphics are more fun too!
Just a couple visualization packages can go a very long way.